-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cardinals up to a hundred trillions, timeFST and transliteration #209
Conversation
@tbartley94 The pull request is ready for review. |
def __init__(self): | ||
super().__init__(name="time", kind="classify") | ||
|
||
hours = pynini.string_map([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind moving this and the minutes into a data file? (It's easier to debug and maintain if we just keep it as text and read from there.)
('12', 'saa sita'), | ||
]) | ||
|
||
minutes = pynini.string_map([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
class VerbalizeTimeFst(GraphFst): | ||
def __init__(self): | ||
super().__init__(name="time",kind="verbalize") | ||
hour = (pynutil.delete("hours:")+delete_space+pynutil.delete("\"")+pynini.closure(NEMO_CHAR,1,60)+pynutil.delete("\"")+delete_space \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just do base closure since the hours:
property will limit potential tokens. I also believe it is more efficient since counting in FST graphs enlarges the graph (iirc).
def __init__(self): | ||
super().__init__(name="time",kind="verbalize") | ||
hour = (pynutil.delete("hours:")+delete_space+pynutil.delete("\"")+pynini.closure(NEMO_CHAR,1,60)+pynutil.delete("\"")+delete_space \ | ||
+pynutil.delete("minutes:")+delete_space+pynutil.delete("\"") + pynini.closure(NEMO_CHAR,1,60)+pynutil.delete("\"")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
+ delete_space | ||
+ pynutil.delete("}") | ||
) | ||
graph = delete_space + pynini.closure(graph + delete_extra_space) + graph + delete_space |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So heads up, this has a behavior that you can't separate normalization candidates with punctuation, there needs to be space or its a no-opt. This intended for y'alls downstream needs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give me an example of this? I don't fully get what you mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1,5 -> one,5
1, 5 -> one, five
It won't parse the comma as a class separator since delete_extra_space
requires at least one space.
from nemo_text_processing.text_normalization.en.graph_utils import GraphFst,NEMO_CHAR,insert_space | ||
from nemo_text_processing.text_normalization.rw.utils import get_abs_path | ||
|
||
def apply_fst(text, fst): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method is unnecessary. the normalize class takes care of this with additional preprocessing. Please remove.
def __init__(self): | ||
super().__init__(name="cardinal", kind="classify") | ||
alphabet = string.ascii_letters | ||
rewrite_na_fst = pynini.cdrewrite(pynini.cross(" "," na "),pynini.union(*"aeiouAEIOU "),pynini.union(*"BCDFGHJKLMNPQRSTVWXYZbcdfghjklmnpqrstvwxyz"),NEMO_CHAR.closure()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For line legitibility, mind just defining two string classes called vowels
and consonants
and inherit from your graph_fst class?
class CardinalFst(GraphFst): | ||
def __init__(self): | ||
super().__init__(name="cardinal", kind="classify") | ||
alphabet = string.ascii_letters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just use NEMO_ALPHA since its already defined.
rewrite_na_fst = pynini.cdrewrite(pynini.cross(" "," na "),pynini.union(*"aeiouAEIOU "),pynini.union(*"BCDFGHJKLMNPQRSTVWXYZbcdfghjklmnpqrstvwxyz"),NEMO_CHAR.closure()) | ||
rewrite_n_fst = pynini.cdrewrite(pynini.cross(" "," n'"),pynini.union(*"aeiouAEIOU "),pynini.union(*"aeiouAEIOU"),NEMO_CHAR.closure()) | ||
remove_underscore_fst = pynini.cdrewrite(pynini.cross("_"," "),pynini.union(*alphabet),pynini.union(*alphabet),NEMO_CHAR.closure()) | ||
remove_extra_space_fst = pynini.cdrewrite(pynini.cross(" "," "),pynini.union(*alphabet),pynini.union(*alphabet),NEMO_CHAR.closure()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use delete_extra_space
from graph utils.
("tiriyoni_mirongo_inani","8"), | ||
("tiriyoni_mirongo_icyenda","9") | ||
]) | ||
hundreds_of_trillions = pynini.string_map([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move all these string maps to a data folder so easier to maintain.
NINE_ZEROS = "000000000" | ||
|
||
zero = pynini.string_map([("zeru","0")]) | ||
rewrite_remove_comma_fst = pynini.cdrewrite(pynini.cross(",",""),pynini.union(*"0123456789"),pynini.union(*"0123456789"),NEMO_CHAR.closure()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use NEMO_DIGIT here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Major requests are just to move more of the string maps into the data
folder to make maintenance easier and suggestions on some FSTs with library variables for consistency. Elsewise things are looking good. One more PR and we can merge.
Great, thank you for the feedback, I will add the changes and get back. |
Let me know if you're running low on bandwidth. We can merge the simpler stuff and leave improvements for other PRs. |
Yes, we should do this. I fixed most of the issues you pointed out, and I can work on the rest on a future PR. |
return fst @ pynini.cdrewrite(pynini.cross(NEMO_SPACE, NEMO_NON_BREAKING_SPACE), "", "", NEMO_SIGMA) | ||
|
||
|
||
def string_map_cased(input_file: str, input_case: str = INPUT_LOWER_CASED): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add documentation, just make the flag a boolean to avoid the string matching
if input_case == INPUT_CASED: | ||
additional_labels = [] | ||
for written, spoken, *weight in labels: | ||
written_capitalized = written[0].upper() + written[1:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just use capwords
https://docs.python.org/3/library/string.html#helper-functions.
written_capitalized = written[0].upper() + written[1:] | ||
additional_labels.extend( | ||
[ | ||
[written_capitalized, spoken.capitalize(),], # first letter capitalized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
spoken_no_space = spoken.replace(" ", "") | ||
# add abbreviations without spaces (both lower and upper case), i.e. "BMW" not "B M W" | ||
if len(spoken) == (2 * len(spoken_no_space) - 1): | ||
logger.debug(f"This is weight {weight}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need documentation for this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the function since it is not used in the Kinyarwanda text normalization; I believe the section you flagged was checking for abbreviations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And for the Jenkins issue, I rebased our branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@zoobereq can you assist in figuring out the jenkins issue? |
|
Please make sure to accept and implement the formatting changes pushed by In order to pass the CI pipeline ( We use Please let me know if you need further assistance with this PR. We will be happy review and merge it as soon as all checks are passed. |
Signed-off-by: kurt0cougar <[email protected]>
Signed-off-by: kurt0cougar <[email protected]>
…ing constants to data files). Signed-off-by: kurt0cougar <[email protected]>
Signed-off-by: kurt0cougar <[email protected]>
Signed-off-by: kurt0cougar <[email protected]>
Almost there. It's failing the formatting, which is easy to address. Either pull the changes generated by the |
…formatteda Signed-off-by: kurt0cougar <[email protected]>
Done. |
nemo_text_processing/text_normalization/rw/verbalizers/verbalize_final.py
Fixed
Show fixed
Hide fixed
The
|
Signed-off-by: kurt0cougar <[email protected]>
@zoobereq I'm feeling if CL tests pass we can just approve and then catch formatting things on our end. Make things easier. Mind giving a second set of eyes before we approve? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good on my end.
* Cardinals up to a hundred trillions, timeFST and transliteration Signed-off-by: kurt0cougar <[email protected]> * Cardinals up to a hundred trillions, timeFST and transliteration Signed-off-by: kurt0cougar <[email protected]> * Cardinals up to a hundred trillions, timeFST and transliteration (moving constants to data files). Signed-off-by: kurt0cougar <[email protected]> * Update test_cases_word.txt Signed-off-by: kurt0cougar <[email protected]> * Update graph_utils.py Signed-off-by: kurt0cougar <[email protected]> * Cardinals up to a hundred trillions, timeFST and transliteration - reformatteda Signed-off-by: kurt0cougar <[email protected]> * Disabled Black during formatting. Signed-off-by: kurt0cougar <[email protected]> --------- Signed-off-by: kurt0cougar <[email protected]> Signed-off-by: Alex Cui <[email protected]>
What does this PR do ?
Adds Kinyarwanda TN support for the following:
CARDINALS
semiotic class (up to hundred trillion)TIME
semiotic classBefore your PR is "Ready for review"
Pre checks:
git commit -s
to sign.pytest
or (if your machine does not have GPU)pytest --cpu
from the root folder (given you marked your test cases accordingly@pytest.mark.run_only_on('CPU')
).bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
pytest
and Sparrowhawk here.__init__.py
for every folder and subfolder, includingdata
folder which has .TSV files?Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
to all newly added Python files?Copyright 2015 and onwards Google, Inc.
. See an example here.try import: ... except: ...
) if not already done.PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.